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<N ! Abstract 

We study the problem of decision-theoretic online learning (DTOL). Motivated 
by practical applications, we focus on DTOL when the number of actions is very 
large. Previous algorithms for learning in this framework have a tunable learning 
rate parameter, and a barrier to using online-learning in practical applications is 
that it is not understood how to set this parameter optimally, particularly when the 
number of actions is large. 

In this paper, we offer a clean solution by proposing a novel and completely 
parameter-free algorithm for DTOL. We introduce a new notion of regret, which 
■ is more natural for applications with a large number of actions. We show that our 

algorithm achieves good performance with respect to this new notion of regret; in 
addition, it also achieves performance close to that of the best bounds achieved 
' by previous algorithms with optimally-tuned parameters, according to previous 

' notions of regret. 
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In this paper, we consider the problem of decision-theoretic online learning (DTOL), proposed by 
Freund and Schapire [TJ. DTOL is a variant of the problem of prediction with expert advice 013]. 
In this problem, a learner must assign probabilities to a fixed set of actions in a sequence of rounds. 
After each assignment, each action incurs a loss (a value in [0, 1]); the learner incurs a loss equal 
to the expected loss of actions for that round, where the expectation is computed according to the 
learner's current probability assignment. The regret (of the learner) to an action is the difference 
between the learner's cumulative loss and the cumulative loss of that action. The goal of the learner 
is to achieve, on any sequence of losses, low regret to the action with the lowest cumulative loss (the 
best action). 

DTOL is a general framework that captures many learning problems of interest. For example, con- 
sider tracking the hidden state of an object in a continuous state space from noisy observations |4|. 
To look at tracking in a DTOL framework, we set each action to be a path (sequence of states) over 
the state space. The loss of an action at time t is the distance between the observation at time t and 
the state of the action at time t, and the goal of the learner is to predict a path which has loss close 
to that of the action with the lowest cumulative loss. 

The most popular solution to the DTOL problem is the Hedge algorithm JT]|5l . In Hedge, each action 
is assigned a probability, which depends on the cumulative loss of this action and a parameter rj, also 
called the learning rate. By appropriately setting the learning rate as a function of the iteration J6][7] 
and the number of actions, Hedge can achieve a regret upper-bounded by 0{s/T\n.N), for each 
iteration T, where N is the number of actions. This bound on the regret is optimal as there is a 
fi(VTlniV) lower-bound 0. 

In this paper, motivated by practical applications such as tracking, we consider DTOL in the regime 
where the number of actions N is very large. A major barrier to using online-learning for practical 
problems is that when N is large, it is not understood how to set the learning rate rj. (TJ suggest 
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Figure 1 : A new notion of regret. Suppose each action is a point on a line, and the total losses are 
as given in the plot. The regret to the top e-quantile is the difference between the learner's total loss 
and the total loss of the worst action in the indicated interval of measure e. 



setting 77 as a fixed function of the number of actions TV. However, this can lead to poor performance, 
as we illustrate by an example in Section [3j and the degradation in performance is particularly 
exacerbated as TV grows larger. One way to address this is by simultaneously running multiple 
copies of Hedge with multiple values of the learning rate, and choosing the output of the copy 
that performs the best in an online way. However, this solution is impractical for real applications, 
particularly as TV is already very large. (For more details about these solutions, please see Section|4]) 

In this paper, we take a step towards making online learning more practical by proposing a novel, 
completely adaptive algorithm for DTOL. Our algorithm is called NormalHedge. NormalHedge 
is very simple and easy to implement, and in each round, it simply involves a single line search, 
followed by an updating of weights for all actions. 

A second issue with using online-learning in problems such as tracking, where TV is very large, is 
that the regret to the best action is not an effective measure of performance. For problems such as 
tracking, one expects to have a lot of actions that are close to the action with the lowest loss. As 
these actions also have low loss, measuring performance with respect to a small group of actions 
that perform well is extremely reasonable - see, for example, FigureQ] 

In this paper, we address this issue by introducing a new notion of regret, which is more natural 
for practical applications. We order the cumulative losses of all actions from lowest to highest and 
define the regret of the learner to the top e-quantile to be the difference between the cumulative loss 
of the learner and the [eTVj -th element in the sorted list. 

We prove that for NormalHedge, the regret to the top e-quantile of actions is at most 



which holds simultaneously for all T and e. If we set e = 1/TV, we get that the regret to the best 



action is upper-bounded by O y\/T In TV + In 2 TVj, which is only slightly worse than the bound 

achieved by Hedge with optimally-tuned parameters. Notice that in our regret bound, the term 
involving T has no dependence on TV. In contrast, Hedge cannot achieve a regret-bound of this 
nature uniformly for all e. (For details on how Hedge can be modified to perform with our new 
notion of regret, see SectionlDi. 

NormalHedge works by assigning each action i a potential; actions which have lower cumulative 
loss than the algorithm are assigned a potential cxp(i? 2 t /2ct), where i?^ is the regret of action 
i and c t is an adaptive scale parameter, which is adjusted from one round to the next, depending 
on the loss-sequences. Actions which have higher cumulative loss than the algorithm are assigned 
potential 1, The weight assigned to an action in each round is then proportional to the derivative of its 
potential. One can also interpret Hedge as a potential-based algorithm, and under this interpretation, 
the potential assigned by Hedge to action i is proportional to cxp^i?^). This potential used by 
Hedge differs significantly from the one we use. Although other potential-based methods have been 
considered in the context of online learning (8), our potential function is very novel, and to the best 
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Initially: Set i?.;.o = 0,Pi,i = 1 /N for each i. 
Fort = 1,2,...' 

1. Each action i incurs loss i^t ■ 

2. Learner incurs loss iA,t = J2iLiPi,t^i,t- 

3. Update cumulative regrets: Ri, t = Ri,t-i + (@A,t — £i,t) f° r eacn i- 

4. Find c t > satisfying exp ( ([fl ^} +)2 ) = e. 

5. Update distribution for round f + 1: Pi,t+i oc ^' e ^ + exp f -^^^ ) ^ or eacn 

Figure 2: The Normal-Hedge algorithm. 



of our knowledge, has not been studied in prior work. Our proof techniques are also different from 
previous potential-based methods. 

Another useful property of NormalHedge, which Hedge does not possess, is that it assigns zero 
weight to any action whose cumulative loss is larger than the cumulative loss of the algorithm it- 
self. In other words, non-zero weights are assigned only to actions which perform better than the 
algorithm. In most applications, we expect a small set of the actions to perform significantly better 
than most of the actions. The regret of the algorithm is guaranteed to be small, which means that the 
algorithm will perform better than most of the actions and thus assign them zero probability. 

l9l [l0l have proposed more recent solutions to DTOL in which the regret of Hedge to the best action 
is upper bounded by a function of L, the loss of the best action, or by a function of the variations in 
the losses. These bounds can be sharper than the bounds with respect to T. Our analysis (and in fact, 
to our knowledge, any analysis based on potential functions in the style of ifTTl lHIl) do not directly 
yield these kinds of bounds. We therefore leave open the question of finding an adaptive algorithm 
for DTOL which has regret upper-bounded by a function that depends on the loss of the best action. 

The rest of the paper is organized as follows. In Section 2, we provide NormalHedge. In Section 
3, we provide an example that illustrates the suboptimality of standard online learning algorithms, 
when the parameter is not set properly. In Section 4, we discuss Related Work. In Section 5, we 
present some outlines of the proof. The proof details are in the Supplementary Materials. 



2 Algorithm 

2.1 Setting 

We consider the decision-theoretic framework for online learning. In this setting, the learner is given 
access to a set of N actions, where N > 2. In round t, the learner chooses a weight distribution 
Pt = (pi.t, ■ ■ ■ ,PN,t) over the actions 1,2, . . . ,N. Each action i incurs a loss ^ and the learner 
incurs the expected loss under this distribution: 

N 

tA,t = 2jpi,A,t- 

i=l 

The learner's instantaneous regret to an action i in round t is r^t = lA,t — &%,u an d its (cumulative) 
regret to an action i in the first t rounds is 

t 

Ri.t = / J fry. 

T=l 

We assume that the losses lie in an interval of length 1 {e.g. [0, 1] or [—1/2, 1/2]; the sign of the 
loss does not matter). The goal of the learner is to minimize this cumulative regret Ri t to any action 
i (in particular, the best action), for any value of t. 
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2.2 Normal-Hedge 

Our algorithm, Normal-Hedge, is based on a potential function reminiscent of the half-normal dis- 
tribution, specifically 



where [x]+ denotes max{0, x}. It is easy to check that this function is separately convex in x and c, 
differentiable, and twice-differentiable except at x = 0. 

In addition to tracking the cumulative regrets i?j >t to each action i after each round t, the algorithm 
also maintains a scale parameter q. This is chosen so that the average of the potential, over all 
actions i, evaluated at i?^ and q, remains constant at e: 



We observe that since 4>(x, c) is convex in c > 0, we can determine c t with a line search. 

The weight assigned to i in round t is set proportional to the first-derivative of the potential, evaluated 

at Ri, t -i and c t -\: 



Notice that the actions for which Ri,t-i < receive zero weight in round t. 
We summarize the learning algorithm in Figure |2] 

3 An Illustrative Example 

In this section, we present an example to illustrate that setting the parameters of DTOL algorithms 
as a function of N, the total number of actions, is suboptimal. To do this, we compare the perfor- 
mance of NormalHedge with two representative algorithms: a version of Hedge due to [7|, and the 
Polynomial Weights algorithm, due to lfl2l[TT1l . Our experiments with this example indicate that the 
performance of both these algorithms suffer because of the suboptimal setting of the parameters; on 
the other hand, NormalHedge automatically adapts to the loss-sequences of the actions. 

The main feature of our example is that the effective number of actions n (i.e. the number of distinct 
actions) is smaller than the total number of actions N. Notice that without prior knowledge of the 
actions and their loss-sequences, one cannot determine the effective number actions in advance; as a 
result, there is no direct method by which Hedge and Polynomial Weights could set their parameters 
as a function of n. 

Our example attempts to model a practical scenario where one often finds multiple actions with 
loss-sequences which are almost identical. For example, in the tracking problem, groups of paths 
which are very close together in the state space, will have very close loss-sequences. Our example 
indicates that in this case, the performance of Hedge and the Polynomial Weights will depend on 
the discretization of the state space, however, NormalHedge will comparatively unaffected by such 
discretization. 

Our example has four parameters: N, the total number of actions; n, the effective number of actions 
(the number of distinct actions); k, the (effective) number of good actions; and e, which indicates 
how much better the good actions are compared to the rest. Finally, T is the number of rounds. 

The instantaneous losses of the N actions are represented by a N x T matrix B e ^ k ; the loss of 
action i in round t is the (£, i)-th entry in the matrix. The construction of the matrix is as follows. 
First, we construct a (preliminary) n x T matrix A n based on the 2 d x 2 d Hadamard matrix, where 
n = 2 d+1 — 2. This matrix A n is obtained from the 2 d x 2 d Hadamard matrix by (1) deleting 
the constant row, (2) stacking the remaining rows on top of their negations, (3) repeating each row 
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horizontally T/2 d times, and finally, (4) halving the first column. We show Aq for concreteness: 



r -1/2 


+ 1 


-1 


+ 1 


-1 


+ 1 


-1 


+ 1 


-1 


+ 1 


-1 


+ 1 


-1/2 


-1 


+ 1 


+ 1 


-1 


-1 


+ 1 


+ 1 


-1 


-1 


+ 1 


+ 1 


-1/2 


+ 1 


+ 1 


-1 


-1 


+ 1 


+ 1 


-1 


-1 


+ 1 


+ 1 


-1 


+ 1/2 


-1 


+ 1 


-1 


+ 1 


-1 


+ 1 


-1 


+ 1 


-1 


+ 1 


-1 


+ 1/2 


+ 1 


-1 


-1 


+ 1 


+ 1 


-1 


-1 


+ 1 


+ 1 


-1 


-1 


+ 1/2 


-1 


-1 


+ 1 


+ 1 


-1 


-1 


+ 1 


+ 1 


-1 


-1 


+ 1 



If the rows of A n give the losses for n actions over time, then it is clear that on average, no action 
is better than any other. Therefore for large enough T, for these losses, a typical algorithm will 
eventually assign all actions the same weight. Now, let A^ k be the same as A n except that e is 
subtracted from each entry of the first k rows, e.g. 



A E ' 2 - 



-1/2 - e 
-1/2 - e 
-1/2 

+ 1/2 
+ !/2 
+ 1/2 



-1-e 
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-1 - 
-1 
-1 
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Now, when losses are given by A^ k , the first k actions (the good actions) perform better than the 
remaining n — k; so, for large enough T, a typical algorithm will eventually recognize this and 
assign the first k actions equal weights (giving little or no weight to the remaining n — k). Finally, 
we artificially replicate each action (each row) N/ n times to yield the final loss matrix B £ ^ k for N 
actions: 

A £ j k 



u e,k _ 

a N - 



A 



A 



e .k 



N/n replicates of A 



The replication of actions significantly affects the behavior of algorithms that set parameters with 
respect to the number of actions N, which is inflated compared to the effective number of actions n. 
NormalHedge, having no such parameters, is completely unaffected by the replication of actions. 

We compare the performance of NormalHedge to two other representative algorithms, which we 
call "Exp" and "Poly". Exp is a time/variation-adaptive version of Hedge (exponential weights) 
due to Q (roughly, r\ t = Q(\/ (log AQ/Vart), where Var t is the cumulative loss variance). Poly 
is polynomial weights |[T2l [Til , which has a parameter p that is typically set as a function of the 
number of actions; we set p = 2 In N as is recommended to guarantee a regret bound comparable to 
that of Hedge. 

Figure |3] shows the regrets to the best action versus the replication factor N/n, where the effective 
number of actions n is held fixed. Recall that Exp and Poly have parameters set with respect to the 
number of actions N. 

We see from the figures that NormalHedge is completely unaffected by the replication of actions; 
no matter how many times the actions may be replicated, the performance of NormalHedge stays 
exactly the same. In contrast, increasing the replication factor affects the performance of Exp and 
Poly: Exp and Poly become more sensitive to the changes in the total losses of the actions {e.g. the 
base of the exponent in the weights assigned by Exp increases with N); so when there are multiple 
good actions {i.e. k > 1), Exp and Poly are slower to stabilize their weights over these good actions. 
When k = 1, Exp and Poly actually perform better using the inflated value N (as opposed to n), as 
this causes the slight advantage of the single best action to be magnified. However, this particular 
case is an anomaly; this does not happen even for k — 2. We note that if the parameters of Exp 
and Poly were set to be a function of n, instead of N, then, then their performance would also 
not depend on the replication factor (the peformance would be the same as the N/n = 1 case). 
Therefore, the degradation in performance of Exp and Poly is solely due to the suboptimality in 
setting their parameters. 
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Figure 3: Regrets to the best action after T = 32768 rounds, versus replication factor N/n. Recall, 
k is the (effective) number of good actions. Here, we fix n = 126 and e = 0.025. 



4 Related work 



There has been a large amount of literature on various aspects of DTOL. The Hedge algorithm of 
U belongs to a more general family of algorithms, called the exponential weights algorithms; these 
are originally based on Littlestone and Warmuth's Weighted Majority algorithm 12, and they have 
been well-studied. 

The standard measure of regret in most of these works is the regret to the best action. The original 
Hedge algorithm has a regret bound of 0(^/Tlog N). Hedge uses a fixed learning rate rj for all 
iterations, and requires one to set rj as a function of the total number of iterations T. As a result, 
its regret bound also holds only for a fixed T. The algorithm of |[T3l guarantees a regret bound 
of 0(\/T\og N) to the best action uniformly for all T by using a doubling trick. Time-varying 
learning rates for exponential weights algorithms were considered in [6|; there, they show that if 
ijt = y/8hi(N)/t, then using exponential weights with r\ = rjt in round t guarantees regret bounds 

of VWhiN + 0(]nN) for any T. This bound provides a better regret to the best action than we 
do. However, this method is still susceptible to poor performance, as illustrated in the example in 
Section |3] Moreover, they do not consider our notion of regret. 

Though not explicitly considered in previous works, the exponential weights algorithms can be 
partly analyzed with respect to the regret to the top e-quantile. For any fixed e, Hedge can be 
modified by setting i] as a function of this e such that the regret to the top e-quantile is at most 
0(y/T log(l/e)). The problem with this solution is that it requires that the learning rate to be 

set as a function of that particular e (roughly t] = yf (log 1/e) /T). Therefore, unlike our bound, 
this bound does not hold uniformly for all e. One way to ensure a bound for all e uniformly is to 
run log N copies of Hedge, each with a learning rate set as a function of a different value of e. A 
final master copy of the Hedge algorithm then looks at the probabilities given by these subordinate 
copiesto give the final probabilities. However, this procedure adds an additive 0(^/T log log N) 
factor to the regret to the e quantile of actions, for any e. More importantly, this procedure is also 
impractical for real applications, where one might be already working with a large set of actions. 
In contrast, our solution NormalHedge is clean and simple, and we guarantee a regret bound for all 
values of e uniformly, without any extra overhead. 
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More recent work in lfl4l [71 [10) provide algorithms with significantly improved bounds when the 
total loss of the best action is small, or when the total variation in the losses is small. These bounds 
do not explicitly depend on T, and thus can often be sharper than ones that do (including ours). We 
stress, however, that these methods use a different notion of regret, and their learning rates depend 
explicitly on N. 

Besides exponential weights, another important class of online learning algorithms are the poly- 
nomial weights algorithms studied in lfl2l [TT1 l8l. These algorithms too require a parameter; this 
parameter does not depend on the number of rounds T, but depends crucially on the number of ac- 
tions N. The weight assigned to action i in round t is proportional to ([Ri yt -i]+) p ~ 1 for somep > 1; 
setting p = 2 In N yields regret bounds of the form y / 2eT(\nN — 0.5) for any T. Our algorithm 
and polynomial weights share the feature that zero weight is given to actions that are performing 
worse than the algorithm, although the degree of this weight sparsity is tied to the performance of 
the algorithm. Finally, ifTBl derive a time-adaptive variation of the follow-the-(perturbed) leader 
algorithm lfl6l[T7l by scaling the perturbations by a parameter that depends on both t and N, 

5 Analysis 

5.1 Main results 

Our main result is the following theorem. 

Theorem 1. If Normal-Hedge has access to N actions, then for all loss sequences, for all t, for all 
< e < 1 and for allO < 5 < 1/2, the regret of the algorithm to the top e-quantile of the actions is 
at most 

J(l + ln(l/e)) (3(1 + mS)t + + m AT)) . 

In particular, with e = 1/N, the regret to the best action is at most 

^(1 + In AO ( 3 (1 + 50S)t + ^P^P + nW) . 

The value 6 in Theorem[T]appears to be an artifact of our analysis; we divide the sequence of rounds 
into two phases - the length of the first is controlled by the value of S - and bound the behavior of 
the algorithm in each phase separately. The following corollary illustrates the performance of our 
algorithm for large values of t, in which case the effect of this first phase (and the 6 in the bound) 
essentially goes away. 

Corollary 2. If Normal-Hedge has access to N actions, then, as t — > 00, the regret of Normal- 
Hedge to the top e-quantile of actions approaches an upper bound of 

V3t(l + ln(l/e)) + o(t) . 
In particular, the regret of Normal-Hedge to the best action approaches an upper bound of 

V / 3<(l + ln7V) + o(i). 

The proof of Theorem Q] follows from a combination of Lemmas [3] |U and [5] and is presented in 
detail at the end of the current section. 

5.2 Regret bounds from the potential equation 

The following lemma relates the performance of the algorithm at time t to the scale c t . 
Lemma 3. At any time t, the regret to the best action can be bounded as: 

maxRn < ^2c t (\nN + 1) 

i 

Moreover, for any < e < 1 and any t, the regret to the top e-quantile of actions is at most 

V2ct(ln(l/e) + l). 
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Proof. We use E t to denote the actions that have non-zero weight on iteration t. The first part of the 
lemma follows from the fact that, for any action i € E t , 

( {Ri,tf \ ( {[Ri,t]+f \ , ^ f ([Ri>,t] + ) 2 \ * AT 



which implies i? M < v /2c t (lniV + 1). 

For the second part of the lemma, let Rij denote the regret of our algorithm to the action with the 
eiV-th highest regret. Then, the total potential of the actions with regrets greater than or equal to 
Ri t is at least: 

from which the second part of the lemma follows. □ 
5.3 Bounds on the scale c t and proof of Theorem Q] 

In Lemmas [4] and[5] we bound the growth of the scale c t as a function of the time t. 

The main outline of the proof of Theorem[T]is as follows. As q increases monotonically with t, we 
can divide the rounds t into two phases, t < t and t > t , where t is the first time such that 



c* > 



41n 2 iV 16 In AT 



for some fixed 5 E (0, 1/2). We then show bounds on the growth of c t for each phase separately. 
Lemma [4] shows that c t is not too large at the end of the first phase, while Lemma [5] bounds the 
per-round growth of ct in the second phase. 

Lemma 4. For any time t, 

ct+i < 2ct(l + lniV)+3. 

Lemma 5. Suppose that at some time to, ct > 4ln s N + 16 ]" N , where < 5 < | is a constant. 
Then, for any time t > t , 

ct+i-ct < |(1 + 49.195) . 

We now combine Lemmas|4]and|5]together with Lemma[3]to prove the main theorem. 

Proof of Theorem^ Let i be the first time at which c to > i^JL + 16 ^ N . Then, from LemmalU 

ct < 2c to _ 1 (l + lniV) + 3, 

which is at most: 

8 In 3 TV 34 In 2 TV 321niV 8 In 3 iV 811n 2 iV 

The last inequality follows because N > 2 and 6 < 1/2. By Lemma|5] we have that for any t > t , 

c t < |(l + 49.19*)(t-t ) + c to . 
Combining these last two inequalities yields 

3,„ .„ „ rt 8 In 3 AT 81 In 2 N 



c t < -(1 + 49.195)* ■ 



2 y ' 5 S 3 ' 

Now the theorem follows by applying Lemma|3] □ 
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6 Remaining proofs 
6. 1 Proof of Lemma S] 

Proof of Lemma\4\ To show Lemma[4] we first show that, for any t, 

^^0(i? M+1 ,2c t (l + lniV)+3)<e. (3) 

i 

For any i, Ri } t+i < R%,t + 1, so the left hand side of the above inequality is at most 

(Ri,t + 1) 2 



st(l + lnJV) + 6 



This, in turn, can be upper bounded by 



1 E 



N V GXP V4c t (l + lniV) + 6 J '° XP V4c t (l + ln7V)+6y ' ° XP ^(1 + In N) + 6 

We now bound each term in this summation. First, we note that using Lemma[3] the first term can 
be bounded as 

cxp ( 3* ^1 < exp < e i/ 2 

The second term can be bounded as 

CXP U(l + ln^) + 6 ) ~ CXP ^(1 + 111^) + 6 J " 6 ^ 

The last inequality follows by noticing that 2a + 6/ a > 4V3 for any a > 0, and in particular for 
a = y^2ct(I + InJV). Finally, the third term is trivially bounded by e 1 / 6 . Combining the bounds 
for the three terms in (0]i gives 



■i^(i?i, 4+1) 2 Ct (l + mAf)+3)< e . 



TV 

Since the quantity <^(i?i >i+ i, c) is always increasing with c, Equation (01 implies that c t+ i < 
2c t (l + In AT) + 3. The lemma follows. □ 

6.2 A bootstrap for Lemma H] 

Before we can prove Lemma [gj we first show a somewhat weaker bound on the growth of ct with t 
(Lemma[6); this bound is used in the proof of Lemma|5]which concludes with the tighter bound on 

Q+l - Ct- 

Lemma 6. Suppose that at some time to, c to > 16 1§ , where 0<5<l/2isa constant. Then, for 
any time t > t , 

e 5 a + S + lnN) 
c t+1 -c t < T - ¥? . 



The main idea behind the proof of Lemma [6] is as follows. First, we use Lemma [7] to show that 
Ct is monotonic in t, and to get an expression for ct+i — q as a ratio of some derivatives and 
double derivatives of the potential function <f>. Next, we use Lemma[8]and Corollary[10]to bound the 
numerator and denominator of this ratio. Combining these bounds gives us a proof of Lemma[6] 

We denote by Et,t+i = Et U E t +\ the actions relevant to the change of potential between iterations 
t and t + 1 (recall, E t are the actions with non-zero weight on iteration t). 
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Lemma 7. At any time t, 

c t +i - c t > 0. 

Moreover, Ct+i ~ Ct is at most: 

2^ieE t+1 ei +1 eX P V 2e t+1 J 

where £t+i lies in between Ct and Ct+i and for each i, pi_ t lies between Ri >t and Ri t t+i- 

Proof. We consider the change in average potential due to the regrets changing from Ri, t to Ri,t+i 
(with the scale fixed at ct), and then the change due to the scale changing from q to Ct+i'- 

N N 
i=l i=l 

= ^0(iJi )t+1 ,ct)- <f>(Ri, t , c t ) (4) 

i=l 

AT 

■/]<t>(Rj,t+i,ct+i) ~ <f>{Ri,t+i,ct)- (5) 



It is clear that the sum in (0]i can be restricted to i € E t j+i, and that the sum in (0 can be restricted 
to i e St+i. We now will express Q in terms of ct+i — Ct and employ upper and lower bounds on 
©. 

First, we derive bounds on ©. Let ?/>(x) = cxp(x 2 / (2c t )). Then /(x) = (f>(x, c t ) can be written as 

ei \ _ j ^P(x) if x > 
■' W ~ | V(0) if x < 0. 

The function / satisfies the preconditions of Lemma QT| (deferred to the end of the section), so we 
have 

f(Ri,t+i) - f(Ri,t) > f'(Ri,t)ri,t+i (6) 

and 

f{R ht+ i) - f(Ri,t) < f'(Ri,t)n,t + i + ^£^r? t+1 (7) 

where min{i^ it , Ri, t +i} < p%.t < max{i^ lt , Ri,t+i} and r M+i = Ri,t+i - Ri,t- Now we sum 
both the lower and upper bounds (Eqs. © and (0) over i € £?t,t+i and apply the fact 

iV 

E /'(^Kt+i = ^/'(i? M K t+ i = 

iG-E t ,t+i i=l 

which follows easily from the fact that the weight assigned to an action i in trial t + 1 is proportional 
to f'(Ri tt ). Thus, 

< (/>{Ri t t+i,Ct)-(t>{Ri,t,ct) (8) 



To deal with ©, we view it as a function of c t +i and then equated via Taylor's theorem to a first- 
order expansion around ct 



-(ct+i-ct)- 2^ —-3 exp — 



2e t+1 \ 2e t+1 
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for some f t+ i between c t and c t +\. Substituting this back into (0, we have 

■ do) 

iS-Et.t+i ie-Bt+i 

The summation on the right-hand side is non-negative, as is the summation on the left-hand side 
(recall the lower bound (0), so c f+ i — c t is non-negative as well. This shows the first part of the 
lemma. The second part follows from re-arranging Eq. (TTOb and applying the upper bound (0. □ 

Lemma 8. Let, for some t = to, c to > 16 ^" — + \,for some < 6 < 1. Then, for any t > to, 

pL ( pit x 



D gexp g <e^ + l + lniV)iV e 
where the pn are the values introduced in Lemma\7\ 

Proof. Pick any i <G E t ,t+i- If #i,t > 0, then |p i;t | < R ht + 1 = [Ri,t]+ + 1- Otherwise 
Ri,t+i > > i? iit . But since |r ljt+ i| = |i^,t + i -Ri,t\ < L it must be that \p i>t \ < 1 = [i?i jt ] + + l. 
Therefore 

pf, t < ([i?d+ + l) 2 , 

which in turn implies 

exp ^- J < exp { j = exp ■ exp j ■ exp 

To prove the first claim, it suffices to show that each of the two exponentials in the final product 
is bounded by e s/2 . Since c t > c to = (16 In N)/S 2 + 1/5, we have cxp(l/(2c t )) < e 5 / 2 . Also, 
Lemma0 imply 



exp < exp NEHMl) < cxp ^ < e W 

\ ct J \ V^t J \ 4VlrjJV y 

so the first claim follows. 

To prove the second claim, we use the first claim to derive the fact 



max < In exp [ ^— | <<5+l + lnA r 

'6B s , t+1 2c t ~ \2c t I ~ 



which in turn is combined again with the first claim to arrive at 



»eB t ,t+i \ / \ / »6B*,t+i V / 



completing the proof. □ 
Lemma 9. Let B > 1. //J^Li e K * > BiV/or some x > 0, then Y^Li ^ BiV ln B - 

Proof. We consider minimizing fix) = Yli=i Xie Xi under the constraint X)i=i eXi — BN. Define 

the Lagrangian function L(x, A) = X^=i Xie^ + X(BN — J2iLi eXi )- Then (d/dxi)L(x, A) = 
(ajj + 1 — A)e Xi , which is when x, = A — 1. Let g(X) = L(x* , A) be the dual function, where 
x* = A — 1. Then g is maximized when A = 1 + ln_B. By weak duality, sup A g(X) < mi x f{x) 
(with the constraints on x), so f(x) > g(l + InB) = BNhiB. □ 

The lemma above leads to the following corollary. 
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Corollary 10. For any t, 



Proof. Let = ([i?; it +i] + ) 2 /(2ct+i), so we have 2i=i exp(a;i) = iVe by Equation (|2j. Now 
Lemma [9] implies J^Li expfsci) > iVe. The corollary follows since [7^+1]+ = whenever 



Now we prove Lemma|6] 



Proof of Lemma® By Lemma|7]and the fact |r,,t+i| < 1, we have 

Z)i£B tit+1 ( 277 + 2^f ) eX P (^77 



< C t+ i - Ct < 



2^t€E t+1 2cj +1 CX P ^ 2c t + 1 ) 

Combining this with Lemma [8] Corollary [10] we have 

<r Ct+1 + <S + 1 + lniV) ^ e A 1 Ct+i-cA , 1 .... v 

ct+i - c t < — — = H • e 1 + - + 5 + In AT 

c t Ne \ c t ) \ 2 

Re-arranging, and using the fact that ct > Ct > (16 In N)/S 2 , we get that, 

e 5 (l + ± + <S + m7V) e 5 (| + <5 + ln7V) 



c *+i" c * ^ 177; , 1 , r,,-^ ^ 



e i(i+i+g+inAr) - 1 p S( 1 I i +s I 1 ~) ' 

1 r; 1 u e I 16 In W ~ 16 In TV ~ 16 J 

The rest of the lemma follows by plugging in the fact that N > 2, and 8 < 1/2. □ 



6.3 Proof of Lemma [5] 

Finally, we are ready to prove Lemma [5] As in the proof of Lemma [6] here too, we start with an 
upper bound on Ct+i — Ct , obtained from Lemma|7] We then use this upper bound, and the bound 
in Lemma|6]to get a finer bound on the quantity Ct+i — Ct- 



Proof of Lemma\5\ We divide the actions into two sets: 

51 = {ie E t ,t+i ■ [Ri,t+i]+ + 1 < V^cTS} 

5 2 = {ie Et.t+i : [Ri,t+i]+ + 1 > V2*5}. 
Using the fact that |r;. t +i| < 1, the bound from Lemma|7]can be written as 

277 X)ie.E t , t+ i ex P ("2^") 

E^ 1 f|ex P (g f ) 
UeE t+1 ^77 ex P { 2c i+1 ) 

2 / 2 \ 

Pi t I Pi,t \ 

22ies 2 z7f ex P (,277 ) 

2^iEE t+1 2ci +1 LX P V 2c t+1 ) 

We will upper-bound each of these three terms separately. 

The first term is bounded by (c t+ i/c t )e s /2 using Lemma|8]and Corollary [TUl 
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To bound the second term, the definition of Si implies 

^ 2ct \ 2ct J ~ 2ct y \2c t - 

Now Corollary [TUl implies a bound of (ct+i/ct)(Se s /e). 



Now we bound the third term. Note that since \f2c t S > yj2c to 5 > 1, we have that each i E S2 
also in E t +\. So the third term is bounded above by the largest ratio 



Pi t 

2^ exp 



< ii ^exp 



V 2c t+i J 



over all i G E t+ i. Since | /O^ , 1 1 < Ri,t+i + 1, each such ratio is at most 

c 2 t+1 /R ht+1 + 1\ 2 (1 Ri, t+ x\ /(iW) 2 (c t+1 -ct) 

— 5— ■ ■ cxp 1 ■ exp ■ 

c% \ Ri,t+i J \2c t c t J \ 2c t c t +i 

We bound each factor in this product (deferring c 2 +1 /cf until later). 

• As Ri t t+i + 1 > \J2c t 5 and c t > c to > 10/<5 3 , we have y/2c t S > 1/6 and 

Ri tt+1 + 1\ 2 < 1 



«i,t+i / ~ (1 - 



By Lemma[3] we have 



< iZi,t + 1 < \/2ct(l + lnJV) + l, 

so 

/ 1 Ri t+i\ ( 3 itj t 

exp 1 ! < exp 



2c t c t / \2ct c t 

< exp 



since c t > c to > (16 In N)/S 2 > 10/5 2 . 

• We use Lemma|6]with ct > (16 In N)/S 2 , and <J < 1/2 to obtain the crude bound 

ct+i-ct < e s (4 + 2 In A). (1 

Now using this bound along with Lemma[3]and 6 < 1/2 gives 

(i?i,t+i) 2 (c t+ i - ct) e 5 (4 + 21nA)(l + I11A) 6.2 + 4.7 In AT + 3.1 In 2 A 
2c t c t+ i c t ct 

which is at most S since c to > (16 In A) /5 2 + (4 In 2 A) /S. 
Therefore, the third term is bounded by 



c 2 +1 exp(1.65 + 0.156 2 ) c 2 t+l 



2rt 



(i-s) 2 - c 2 (i-sy 

Collecting the three terms in the bound for c t +i — Ct, 

ct+i e s ct+i Se s c 2 t+1 e 2S 



Cf+l — ct < 



ct 2 c t e c 2 (l-<5) 2 
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We bound the ratio ct+i/ct as 

c t+ i , ct+i-ct e 5 (4 + 21niV) 45 2 S 2 2l6e 5 

— < 1 + — < 1 + — i '- < 1 + — + — ■ = 1 H — — < 1 + 5 

c t ~ c t c t 10 8 40 ~ 

where we have used the bound in (fTTl i. c t[) > (16 In TV) /8 2 , and S < 1/2. Therefore, we have 

i f , „ e 25 (i + <5) 2 <y e '(i + <y) 

To finish the proof, we use the fact that 6 < |. Using this condition, and a Taylor series expansion, 
when < 5 < ±, e s < 1 + y/ES < 1 + 1.655. Using this fact, 



I e 5 (l + 5) + ^(l + 5) 



is at most 

1 



-r 3.025 + 2.635 2 + 0.61<5 3 



which in turn is at most 0.5 + 3.49(5. Moreover, 



e25 ( 1 + ^< e -(l + 4^ 



2 



(1-5) 

which again is at most 

(l + 3.3<5)(l+45) 

Using the fact that 6 < \, this is at most 1 + 45.75. The lemma follows by combining this with the 
bound in the previous paragraph. □ 

Lemma 11. Let -0 : R — > R be any continuous, twice-differentiable, convex function such that for 
some a € R, ip is non-decreasing on [a, oo) and ip'(a) = 0. Define / : R — > R by 



Then for any Xq, x € R, 



V'(ti) ifx < a, 



f'(x )(x-x ) < f(x)-f(x ) < f'( Xo )( x - XQ ) + ^l( x - Xo ) 2 
for some mining, a;} < £ < max{xo,x}. 



Proof. The lower bound follows from the convexity of /, which is inherited from the convexity of 
ip. For the upper bound, we first consider the case xq < a and x > a. Then, for some £ <E [a,x], 



f(x)-f(x ) = ip(x) - ip(a) 



no 



= 4>'{a){x - a) + —^{x - a) 2 (12) 
< ip'( a )(x- Xo ) + ^^( X ~ Xo ) 2 (13) 



= f (x )(x-x ) H — {x-xq 



\2 



where ( fT2l follows by Taylor's theorem and ( [T3l follows since x < a < x, ip'(a) > 0, and 
4>"{Q > 0. The case xq > a and x < a is analogous, and the remaining cases are immediate using 
Taylor's theorem. □ 
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4>(x,c) = exp(^M±^ 
— 0(x,c) = exp 



dx c \ 2c 

9 2 Ll ^ 



J (H^)exp(fl) ifx> 
\ if x < 



-0(.t,c) - . x 
dx 2 [ o ifx<0 

d M \ (N+) 2 r AM+) 2 

-0(x,c) = —5— exp 1 



dc^ ' 7 2c 2 2c 



Figure 4: The potential function <j>{x, c) and its derivatives. 
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